Automatic Semantic Tagging of Unknown Proper Names
نویسندگان
چکیده
Implemented methods for proper names recognition rely on large gazetteers of common proper nouns and a set of heuristic rules (e.g. Mr. as an indicator of a PERSON entity type). Though the performance of current PN recognizers is very high (over 90%), it is important to note that this problem is by no means a "solved problem". Existing systems perform extremely well on newswire corpora by virtue of the availability of large gazetteers and rule bases designed for specific tasks (e.g. recognition of Organization and Person entity types as specified in recent Message Understanding Conferences MUC). However, large gazetteers are not available for most languages and applications other than newswire texts and, in any case, proper nouns are an open class. In this paper we describe a context-based method to assign an entity type to unknown proper names (PNs). Like many others, our system relies on a gazetteer and a set of context-dependent heuristics to classify proper nouns. However, due to the unavailability of large gazetteers in Italian, over 20% detected PNs cannot be semantically tagged. The algorithm that we propose assigns an entity type to an unknown PN based on the analysis of syntactically and semantically similar contexts already seen in the application corpus. The performance of the algorithm is evaluated not only in terms of precision, fo l lowing the tradit ion of MUC conferences, but also in terms of Information Gain, an information theoretic measure that takes into account the complexity of the classification task. In troduc t ion In terms of syntactic categories, proper nouns are lexical NPs that can be formed by primitive proper names (Adol fo_Battaglia), groups of proper nouns of different semantic categories (San_Paolo di Brescia), and also of non-proper nouns (Banca dei regolamenti internazionali). In the latter case, capital letters are optional, making the problem of PN items identification even more complex. In the literature, it is accepted that an adequate treatment of proper nouns requires the use of a context-sensitive grammar (McDonald, 1996). McDonald points out that the context sensitivity requirement involves two complementary types of evidence: internal and external. The internal evidence, can be derived from the sequence of words in a text (proper nouns and trigger words, such as Inc., &, Ltd., Company, etc.), and is gained in almost all state-of-art PNs recognisers by the use of large gazetteers and lists of trigger words. The external evidence is the context of a proper noun, that provides classificatory criteria to reinforce internal evidence, if any, or supplies some classificatory evidence. In fact, proper names form an open class, making the incompleteness of gazetteers an obvious problem. The methods for recognition of proper nouns (PNs) described in literature closely reflects this view of the problem. PN identification typically includes: • a gazetteer lookup, which locates simple and complex nominals identifying common PNs, such as companies, person names, locations, etc. • a set of patterns or rules, stated in terms of part-of-speech, syntactic or lexical features (e.g. Mr. as an indicator of a PERSON entity type), orthographic features (e.g. capitalization), etc.
منابع مشابه
Tagging Unknown Proper Names Using Decision Trees
This paper describes a supervised learning method to automatically select from a set of noun phrases, embedding proper names of different semantic classes, their most distinctive features. The result of the learning process is a decision tree which classifies an unknown proper name on the basis of its context of occurrence. This classifier is used to estimate the probability distribution of an ...
متن کاملAutomatic Processing of Proper Names in Texts
This paper shows first the problems raised by proper names in natural language processing. Second, it introduces the knowledge representation structure we use based on conceptual graphs. Then it explains the techniques which are used to process known and unknown proper names. At last, it gives the performance of the system and the further works we intend to deal with. or unknown. Some of these ...
متن کاملDiscovering Lexical Information by Tagging Arabic Newspaper Text
In this paper we describe a system for building an Arabic lexicon automatically by tagging Arabic newspaper text. In this system we are using several techniques for tagging the words in the text and figuring out their types and their features. The major techniques that we are using are: finding phrases, analyzing the affixes of the words, and analyzing their pattems. Proper nouns are particular...
متن کاملAn Approach to Proper Name Tagging for German
This paper presents an incremental method for the tagging of proper names in German newspaper texts. The tagging is performed by the analysis of the syntactic and textual contexts of proper names together with a morphological analysis. The proper names selected by this process supply new contexts which can be used for finding new proper names, and so on. This procedure was applied to a small Ge...
متن کاملUnsupervised methods for developing taxonomies by combining syntactic and statistical information
This paper describes an unsupervised algorithm for placing unknown words into a taxonomy and evaluates its accuracy on a large and varied sample of words. The algorithm works by first using a large corpus to find semantic neighbors of the unknown word, which we accomplish by combining latent semantic analysis with part-of-speech information. We then place the unknown word in the part of the tax...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1998